65 research outputs found
ILP Modulo Data
The vast quantity of data generated and captured every day has led to a
pressing need for tools and processes to organize, analyze and interrelate this
data. Automated reasoning and optimization tools with inherent support for data
could enable advancements in a variety of contexts, from data-backed decision
making to data-intensive scientific research. To this end, we introduce a
decidable logic aimed at database analysis. Our logic extends quantifier-free
Linear Integer Arithmetic with operators from Relational Algebra, like
selection and cross product. We provide a scalable decision procedure that is
based on the BC(T) architecture for ILP Modulo Theories. Our decision procedure
makes use of database techniques. We also experimentally evaluate our approach,
and discuss potential applications.Comment: FMCAD 2014 final version plus proof
Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs
Many problems in areas as diverse as recommendation systems, social network
analysis, semantic search, and distributed root cause analysis can be modeled
as pattern search on labeled graphs (also called "heterogeneous information
networks" or HINs). Given a large graph and a query pattern with node and edge
label constraints, a fundamental challenge is to nd the top-k matches ac-
cording to a ranking function over edge and node weights. For users, it is di
cult to select value k . We therefore propose the novel notion of an any-k
ranking algorithm: for a given time budget, re- turn as many of the top-ranked
results as possible. Then, given additional time, produce the next lower-ranked
results quickly as well. It can be stopped anytime, but may have to continues
until all results are returned. This paper focuses on acyclic patterns over
arbitrary labeled graphs. We are interested in practical algorithms that
effectively exploit (1) properties of heterogeneous networks, in particular
selective constraints on labels, and (2) that the users often explore only a
fraction of the top-ranked results. Our solution, KARPET, carefully integrates
aggressive pruning that leverages the acyclic nature of the query, and
incremental guided search. It enables us to prove strong non-trivial time and
space guarantees, which is generally considered very hard for this type of
graph search problem. Through experimental studies we show that KARPET achieves
running times in the order of milliseconds for tree patterns on large networks
with millions of nodes and edges.Comment: To appear in WWW 201
Principles of Query Visualization
Query Visualization (QV) is the problem of transforming a given query into a
graphical representation that helps humans understand its meaning. This task is
notably different from designing a Visual Query Language (VQL) that helps a
user compose a query. This article discusses the principles of relational query
visualization and its potential for simplifying user interactions with
relational data.Comment: 20 pages, 12 figures, preprint for IEEE Data Engineering Bulleti
Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
We study the question of when we can provide logarithmic-time direct access
to the k-th answer to a Conjunctive Query (CQ) with a specified ordering over
the answers, following a preprocessing step that constructs a data structure in
time quasilinear in the size of the database. Specifically, we embark on the
challenge of identifying the tractable answer orderings that allow for ranked
direct access with such complexity guarantees. We begin with lexicographic
orderings and give a decidable characterization (under conventional complexity
assumptions) of the class of tractable lexicographic orderings for every CQ
without self-joins. We then continue to the more general orderings by the sum
of attribute weights and show for it that ranked direct access is tractable
only in trivial cases. Hence, to better understand the computational challenge
at hand, we consider the more modest task of providing access to only a single
answer (i.e., finding the answer at a given position) - a task that we refer to
as the selection problem. We indeed achieve a quasilinear-time algorithm for a
subset of the class of full CQs without self-joins, by adopting a solution of
Frederickson and Johnson to the classic problem of selection over sorted
matrices. We further prove that none of the other queries in this class admit
such an algorithm.Comment: 17 page
Near-Optimal Distributed Band-Joins through Recursive Partitioning
We consider running-time optimization for band-joins in a distributed system,
e.g., the cloud. To balance load across worker machines, input has to be
partitioned, which causes duplication. We explore how to resolve this tension
between maximum load per worker and input duplication for band-joins between
two relations. Previous work suffered from high optimization cost or considered
partitionings that were too restricted (resulting in suboptimal join
performance). Our main insight is that recursive partitioning of the
join-attribute space with the appropriate split scoring measure can achieve
both low optimization cost and low join cost. It is the first approach that is
not only effective for one-dimensional band-joins but also for joins on
multiple attributes. Experiments indicate that our method is able to find
partitionings that are within 10% of the lower bound for both maximum load per
worker and input duplication for a broad range of settings, significantly
improving over previous work
ILP Modulo Data
Abstract-The vast quantity of data generated and captured every day has led to a pressing need for tools and processes to organize, analyze and interrelate this data. Automated reasoning and optimization tools with inherent support for data could enable advancements in a variety of contexts, from data-backed decision making to data-intensive scientific research. To this end, we introduce a decidable logic aimed at database analysis. Our logic extends quantifier-free Linear Integer Arithmetic with operators from Relational Algebra, like selection and cross product. We provide a scalable decision procedure that is based on the BC(T ) architecture for ILP Modulo Theories. Our decision procedure makes use of database techniques. We also experimentally evaluate our approach, and discuss potential applications
The model-summary problem and a solution for trees
Abstract — Modern science is collecting massive amounts of data from sensors, instruments, and through computer simulation. It is widely believed that analysis of this data will hold the key for future scientific breakthroughs. Unfortunately, deriving knowledge from large high-dimensional scientific datasets is difficult. One emerging answer is exploratory analysis using data mining; but data mining models that accurately capture natural processes tend to be very complex and are usually not intelligible. Scientists therefore generate model summaries to find the most important patterns learned by the model. We formalize the model-summary problem and introduce it as a novel problem to the database community. Generating model summaries creates serious data management challenges: Scientists usually want to analyze patterns in different “slices ” and “dices ” of the data space, comparing the effects of various input variables on the output. We propose novel techniques for efficiently generating such summaries for the popular class of tree-based models. Our techniques leverage workload structure on multiple levels. We also propose a scalable implementation of our techniques in MapReduce. For both sequential and parallel implementation, we achieve speedups of one or more orders of magnitude over the naive algorithm, while guaranteeing the exact same results. I
Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries
We study ranked enumeration of join-query results according to very general
orders defined by selective dioids. Our main contribution is a framework for
ranked enumeration over a class of dynamic programming problems that
generalizes seemingly different problems that had been studied in isolation. To
this end, we extend classic algorithms that find the k-shortest paths in a
weighted graph. For full conjunctive queries, including cyclic ones, our
approach is optimal in terms of the time to return the top result and the delay
between results. These optimality properties are derived for the widely used
notion of data complexity, which treats query size as a constant. By performing
a careful cost analysis, we are able to uncover a previously unknown tradeoff
between two incomparable enumeration approaches: one has lower complexity when
the number of returned results is small, the other when the number is very
large. We theoretically and empirically demonstrate the superiority of our
techniques over batch algorithms, which produce the full result and then sort
it. Our technique is not only faster for returning the first few results, but
on some inputs beats the batch algorithm even when all results are produced.Comment: 50 pages, 19 figure
- …